Quantiles on Streams

نویسندگان

  • Chiranjeeb Buragohain
  • Subhash Suri
چکیده

The need to summarize data has been around since the earliest days of data processing. Large volumes of raw, unstructured data easily overwhelm human ability to comprehend or digest, and tools that help identify the major underlying trends or patterns in data have enormous value. Quantiles characterize distributions of real world data sets in ways that are less sensitive to outliers than simpler alternatives such as the mean and the variance. Consequently, quantiles are of interest to both database implementers and users: for instance, they are a fundamental tool for query optimization, splitting of data in parallel database systems, and statistical data analysis. Quantiles are closely related to the familiar concepts of frequency distributions and histograms. The cumulative frequency distribution F () is commonly used to summarize the distribution of a (totally ordered) set S. Specifically, for any value x,

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

[7] A. Asuncion and D. J. Newman. UCI Machine Learning Repository

[3] Rakesh Agrawal and Arun Swami. A one-pass space-efficient algorithm for finding quantiles. A one-pass algorithm for accurately estimating quantiles for disk-resident data. [8] Jürgen Beringer and Eyke Hüllermeier. An efficient algorithm for instance-based learning on data streams.

متن کامل

Influences of Spatial and Temporal Variation on Fish–Habitat Relationships Defined by Regression Quantiles

—We used regression quantiles to model potentially limiting relationships between the standing crop of cutthroat trout Oncorhynchus clarki and measures of stream channel morphology. Regression quantile models indicated that variation in fish density was inversely related to the width:depth ratio of streams but not to stream width or depth alone. The spatial and temporal stability of model predi...

متن کامل

Evaluation of Summarization Schemes for Learning in Streams

Traditional discretization techniques for machine learning, from examples with continuous feature spaces, are not efficient when the data is in the form of a stream from an unknown, possibly changing, distribution. We present a time-and-memory-efficient discretization technique based on computing ε-approximate exponential frequency quantiles, and prove bounds on the worst-case error introduced ...

متن کامل

A nearly optimal and deterministic summary structure for update data streams

We present a deterministic summary structure over update streams that enables deterministic and the first space-optimal algorithms for a variety of problems, including, estimating frequencies, finding approximate frequent items, finding approximate quantiles, finding hierarchical heavy hitters, approximately optimal B-bucket histograms, estimating inner product sizes, etc..

متن کامل

Exponentially Weighted Simultaneous Estimation of Several Quantiles

In this paper we propose new method for simultaneous generating multiple quantiles corresponding to given probability levels from data streams and massive data sets. This method provides a basis for development of single-pass low-storage quantile estimation algorithms, which differ in complexity, storage requirement and accuracy. We demonstrate that such algorithms may perform well even for hea...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009